MUSA 507 Final Project: Scooter Data Exploration and Demand Prediction
MUSA 507 Final Project: Scooter Data Exploration and Demand Prediction
Miaoyan Zhang, Yichen Lei
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(gganimate)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(ggplot2)
library(ggpubr)
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
palette7 <- c("#FE7568","#FCCA6C","#548FCC","#C85B6C","#315B8A","#FFE682","#FF69AD")
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette3 <- c("#6baed6","#3182bd","#08519c")
palette2 <- c("#6baed6","#08519c")
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}1 Introduction
Over the past decade, shared active transportation systems have become a common phenomenon in North American public streets and right-of-way, creating new travel opportunities and changing the way people travel in cities. As part of shared transportation, the main function of shared scooters is to help people complete their last mile trip. The last mile issue is the difficulty of public transport regarding the difficulty of transferring passengers from private homes to mass transit centers (such as bus stops, train stations, etc.). The shared scooters provide a solution to the last mile problem, because they are lightweight, communal and designed for short trips. The popularity of shared scooters will increase the use of public transportation and reduce the use of private vehicles. Electronic scooters are usually “dockless”, meaning they do not have a fixed station location, but instead pickup or dropoff from any location in the service area.
The negative impact of the convenience of “dockless” is the extreme challenge of management and rebalancing. It is difficult for scooter companies to accurately estimate the demand and return of the scooters based on the short-term information, and then to rebalance the scooters in different service areas appropriately. The common method of scooter companies is to randomly allocate scooters, and to determine the net demand for scooters in areas by detecting whether the scooters are used or not. This experiment-based method takes a long time to find an appropriate allocation scheme, and it is not sensitive enough to short-term factor changes; for example, it is difficult to adjust the allocation based on changing weather.
This project using the 430,000 scooter trips data in Louisville, proposes a data-driven space-time model to predict the net demand for scooters in service areas. The assumption of the method is that the variation in scooter trips is a function of historical time and space ride share patterns. By using time lag and spatial lag, the space-time model uses historical scooter behavior as effective predictors to predict ridership. Due to the “dockless” characteristic of scooters, we have established fishnets for spatial integration, which also solves the problem of low location accuracy caused by privacy protection. In addition, the model also considers the effects of weather and surrounding facilities on scooter trips. Finally, the space time model had been applied to predict the trips’ starts and ends separately, and then we subtracted the number of starts from the number of ends to get the difference in each fishnet, which represents the surplus of scooters in the area. The net difference map reflects the flow trend of scooters in the natural state as people travel. The job of rebalancing is to deploy scooters from the areas where the scooters are surplus to the areas where the scooters are scarce, so our predicted net difference map is valuable for the rebalancing planning.
Based on the established model, we designed two applications for different users. The first one is a web app for scooter companies. Scooter companies can easily know through the web app about the daily net difference of fishnets in the service area and the recommended rebalance plan; at the same time, the exploratory analysis results of the database can help users to gain actionable insight into demand change in time and space for better resource allocation. Another mobile app is for riders. In addition to the basic functions of finding scooters, tracking driving routes, and dynamic payment, a new reward mechanism can be developed in the mobile app based on the predicted net difference map of scooters; that is, through rewarding riders for traveling from predicted scooter-rich areas to scooter-insufficient areas, the self-balance of the scooter system can be achieved.
2 Data Wrangling
In this study, we mainly used four types of data to build the final spatiotemporal panel:
Louisville Scooter Trips Data: collected from Louisville Open Data. It contains all the shared scooter trips from August 2018 and a total of 430,000 trip data were record. As the launch of shared scooters is gradually increasing, the small number of scooters at beginning leads to the trips are less in that period. In this project, we choose stable and sufficient trips during June 10th and August 5th, 2019. The data has 15 attributes, such as start time, end time, start longitude, end latitude,a day of week and hour number.
Louisville Service Distribution Zones: collected from Louisville Open Data. It contains nine service zones in Louisville, such as Northwest Core, Downtown, University, and Iroquois Park. The scooter company allocates and rebalances scooters based on these nine areas. It contains zone codes and zone descriptions.
Louisville Weather Data: collected from Louisville Muhammad Ali International Airport. The airport has professional weather measurement equipment, and its data can represent the local weather in Louisville. It contains three weather attributes: temperature, precipitation, and wind speed.
Census Data: collected from tidycensus. It contains the census tracts and corresponding socioeconomic, such as total population, medium household income, white population, and means of transportation to work. Through spatial join, census tracts that coincide with the service area are selected.
City Facilities: collected from Louisville Open Data. Because scooters are targeted for last-mile trips, we think bus stations, police stations, and right-of-way construction permits may be closely related to scooter trips. Spatial information of urban facilities are the main attributes used in research.
2.1 Trips data
After reading the raw trip data, the origin date column and time column are integrated by the ymd function of the lubridate package. At 15-minute and 60-minute intervals, new standardized time attributes are created; at the same time, week of year and day of week are also established for subsequent research.
ride <- read.csv("C:/Documents/Upenn/2019fall/musa507/Final_project/Data/DocklessTripOpenData_7.csv")
ride$Start_Datetime = as.POSIXct(paste(ride$StartDate, ride$StartTime), format="%Y-%m-%d %H:%M")
ride$End_Datetime = as.POSIXct(paste(ride$EndDate, ride$EndTime), format="%Y-%m-%d %H:%M")
ride <- na.omit(ride)
ride2 <- ride %>%
mutate(interval60 = floor_date(ymd_hms(Start_Datetime), unit = "hour"),
interval15 = floor_date(ymd_hms(Start_Datetime), unit = "15 mins"),
week = week(ymd_hms(interval60)+60*60*24),
dotw = wday(interval60, label=TRUE))2.2 Weather data
As a kind of light vehicle, scooters are not capable of sheltering from wind and rain. Therefore, trips of scooters are easily affected by the weather. After downloading the airport data in the study period through the riem_measures function, we integrated the data group by hours and created a weather panel that is encoded in long-form.
The following figure shows the three attributes of the weather over one year. It can be seen that precipitation is relatively stable throughout the year, and summer precipitation is slightly more than in other seasons. Wind speed is a very random property, and it is difficult to say that it has periodicity and regularity. As for temperature, it has a strong seasonality. Prior to April, the temperature was generally below 75 degrees Fahrenheit, and it could drop to 0 degrees in cold winters. During our research period, in June and July, the temperature was generally stable above 75 degrees Fahrenheit, which is a very suitable environment for driving scooters.
weather.Panel <-
riem_measures(station = "SDF", date_start = "2018-08-09", date_end = "2019-10-31") %>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(ymd_hms(interval60)+60*60*24),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
glimpse(weather.Panel)## Observations: 10,752
## Variables: 4
## $ interval60 <dttm> 2018-08-09 00:00:00, 2018-08-09 01:00:00, 2018-08-09 0…
## $ Temperature <dbl> 81.0, 80.1, 79.0, 79.0, 78.1, 77.0, 77.0, 77.0, 75.9, 7…
## $ Precipitation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Wind_Speed <dbl> 10, 8, 6, 7, 8, 8, 7, 6, 6, 5, 6, 5, 3, 3, 5, 9, 7, 10,…
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Percipitation") + plotTheme(),
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
top="Weather Data - Louisville SDF - May, 2018")2.3 Census data
The census tracts of the Greater Louisville area and their corresponding socio-economic attributes are downloaded through tidycensus. In addition to the attributes directly extracted, some new attributes are created through simple feature engineering, such as the proportion of white people, average commute time, etc. The geometric information of tracts is extracted and shown in the figure below.
census_api_key("772465bff3e908b0e7f1bdad945c1562bb0042b6", overwrite = FALSE, install = FALSE)
LouisvilleCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2017,
state = "21",
geometry = TRUE,
county=c("111"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)Cols = c("Percent_White", "Mean_Commute_Time", "Percent_Taking_Public_Trans")
options(tigris_use_cache = TRUE)
LouisvilleTracts <-
LouisvilleCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf %>%
st_transform(crs=4326)
ggplot() +
geom_sf(data=LouisvilleTracts) +
labs(title = "Louisville census tracts") +
mapTheme() 2.4 Using scooter service area as study area
Compared to the Greater Louisville area, the service area of the scooter occupies only a small area in the north central part, which is the study area of this project. As following figure shows, the data of Census tracts are cut by st_intersection to fit the study area.
Service_Area <- st_read(
"C:/Documents/Upenn/2019fall/musa507/Final_project/Data/Dockless Vehicle Service Area/Dockless_Vehicle_Service_Area.shp") %>%
st_transform(crs=4326)
tracts.StudyArea <-
st_intersection(LouisvilleTracts, st_union(Service_Area)) %>%
distinct(.,GEOID, .keep_all = TRUE)ggplot() +
geom_sf(data=LouisvilleTracts) +
geom_sf(data=tracts.StudyArea, fill = "blue") +
labs(title = "Study Area Tracts") +
mapTheme()2.5 Using fishnet as unit
Because the scooter docking is dockless, we use fishnet to aggregate this random scooter location to a lattice of grid cells, which can represent the smooth spatial surface of locations. Another reason is that the latitude and longitude of the data provided by Louisville Open Data have been blurred to three decimal places to protect user privacy. The use of fishnet just ignores the exact locations of the points and reduces the impact of insufficient accuracy.
Choosing fishnet’s grid cell size is key of study. We have experimented with multiple cell sizes; eventually, a balance was achieved between the panel’s spatial dataset size and interpretation accuracy, and the grid size was set to 0.005. In this case, we generated approximately 900 fishnet cells in the study area.
fishnet <-
st_make_grid(tracts.StudyArea, cellsize = 0.005) %>%
st_sf()
ggplot() +
geom_sf(data=tracts.StudyArea, fill=NA, colour="black",size = 1) +
geom_sf(data=fishnet, fill=NA, colour="black") +
labs(title = "Louisville and the fishnet")fishnet <-
fishnet[tracts.StudyArea,] %>%
mutate(uniqueID = rownames(.)) %>%
dplyr::select(uniqueID)
ggplot() +
geom_sf(data=fishnet) +
labs(title = "Fishnet in Louisville Scooter Service Area") +
mapTheme()LouisvilleCensus <-
LouisvilleCensus%>%
st_transform(crs = 4326)
fishnet_tract <-
st_centroid(fishnet) %>%
st_join(., dplyr::select(LouisvilleCensus, GEOID)) %>%
st_set_geometry(NULL) %>%
left_join(dplyr::select(fishnet, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
dplyr::select(fishnet_tract, GEOID) %>%
gather(Variable, Value, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Value)) +
facet_wrap(~Variable) +
scale_fill_viridis(discrete = TRUE) +
labs(title = "Fishnet of different tracts") +
mapTheme() + theme(legend.position = "none")2.5 Other city facilities data and Features engineerings
We selected three kinds of urban facilities that could affect scooter travel: bus stops, police stations, and right-of-way construction permits. We think the bus stations are closely related to the last mile. People choose to ride scooters from the company or home to the nearby bus stations for public transportation. Then, the role of the police stations is to protect the security of the neighborhood. People often don’t choose to ride scooters to crime-prone areas; on the contrary, scooters near the police station are a fast way of transportation under safe conditions. Finally, right-of-way constructions affect the roads where scooters travel. Their permits show the location of right-of-way constructions, which is valid road information.
Bus_stops <-
st_read("https://opendata.arcgis.com/datasets/1af0d14cc5ce41b292ec688ccef05e56_0.geojson") %>%
st_transform(st_crs(fishnet)) %>%
dplyr::select(geometry) %>%
mutate(Legend = "bus_stops")
Police_station <-
st_read("https://opendata.arcgis.com/datasets/463cb51aeb63468f8e5378754b7764c7_33.geojson") %>%
st_transform(st_crs(fishnet)) %>%
dplyr::select(geometry) %>%
mutate(Legend = "police_station")
Construction_Permits<-
st_read("https://opendata.arcgis.com/datasets/0cfecd8bf248488a86f1ce804bb9af4b_0.geojson") %>%
st_transform(st_crs(fishnet)) %>%
dplyr::select(geometry) %>%
mutate(Legend = "construction_Permits")grid.arrange(
ggplot() +
geom_sf(data = LouisvilleCensus) +
geom_sf(data = Bus_stops, colour="blue", size=0.1, show.legend = "point") +
labs(title= "Louisville Bus stops") +
mapTheme(),
ggplot() +
geom_sf(data = LouisvilleCensus) +
geom_sf(data = Police_station, colour="red", size=2, show.legend = "point") +
labs(title= "Louisville Police stations") +
mapTheme(),
ggplot() +
geom_sf(data = LouisvilleCensus) +
geom_sf(data = Construction_Permits, colour="orange", size=1.5, show.legend = "point") +
labs(title= "Louisville Right-of-Way Construction Permits") +
mapTheme(), ncol = 3) Different feature engineering method are used. We built the
nn_function to analyze the spatial information of these city facilities. For the start or end of each scooter, the average distance of the nearest n facilities is calculated. This can reduce the error and more accurately reflect the spatial relationship between the facilities and the scooters’ locations. We also count number of these facilities in each fishnet. A correlation comparison of these two method will be made in the next chapter.
library(FNN)
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
vars_net <-
rbind(Bus_stops,Police_station,Construction_Permits) %>%
st_join(., fishnet, join=st_within) %>%
st_set_geometry(NULL) %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit()
vars_net$bus_stops.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(Bus_stops), 3)
vars_net$police_station.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(Police_station), 1)
vars_net$construction_Permits.nn =
nn_function(st_coordinates(st_centroid(vars_net)), st_coordinates(Construction_Permits), 3)2.6 Create the final space/time panel
The space-time panel contains all the space-time combinations. In this project, we built panels for the start and end points, respectively. Panel’s spatial properties are built based on fishnets. The initialization of the panel is implemented using the expand.grid function. The space and time attributes are uniquely combined as rows. At a given space and time in each row, the trip counts of the scooters will be counted to fill our panel. The next step is to join other data. Weather data, socioeconomic data for census tracts, and n-nearest urban facilities are all left joined to the panel.
ride_fishnet <- st_join(ride2 %>%
filter(is.na(StartLongitude) == FALSE &
is.na(StartLatitude) == FALSE &
is.na(EndLongitude) == FALSE &
is.na(EndLatitude) == FALSE) %>%
st_as_sf(., coords = c("StartLongitude", "StartLatitude"), crs = 4326),
fishnet %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.fishnetID = uniqueID) %>%
mutate(StartLongitude = unlist(map(geometry, 1)),
StartLatitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>%
st_as_sf(., coords = c("EndLongitude", "EndLatitude"), crs = 4326) %>%
st_join(., fishnet %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.fishnetID = uniqueID) %>%
mutate(EndLongitude = unlist(map(geometry, 1)),
EndLatitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)
ride_fishnet <- na.omit(ride_fishnet)
ride_fishnet <- st_join(ride_fishnet %>%
filter(is.na(StartLongitude) == FALSE &
is.na(StartLatitude) == FALSE &
is.na(EndLongitude) == FALSE &
is.na(EndLatitude) == FALSE) %>%
st_as_sf(., coords = c("StartLongitude", "StartLatitude"), crs = 4326),
LouisvilleTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(StartLongitude = unlist(map(geometry, 1)),
StartLatitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>%
st_as_sf(., coords = c("EndLongitude", "EndLatitude"), crs = 4326) %>%
st_join(., LouisvilleTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(EndLongitude = unlist(map(geometry, 1)),
EndLatitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)
ride_fishnet <- na.omit(ride_fishnet)
ride.template <- ride_fishnet
study.panel.all <-
expand.grid(interval60 = unique(ride.template$interval60[ride.template$Start_Datetime >="2019-06-10 00:00:00" & ride.template$Start_Datetime<"2019-08-05 00:00:00"]),
fishnetID = unique(fishnet$uniqueID))
ride.template$Origin.fishnetID <- as.factor(ride.template$Origin.fishnetID)
ride.template$Destination.fishnetID <- as.factor(ride.template$Destination.fishnetID)
ride.panel.start <-
subset(ride.template, Start_Datetime >="2019-06-10 00:00:00" & Start_Datetime<"2019-08-05 00:00:00") %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel.all, by=c("Origin.fishnetID" = "fishnetID", "interval60"= "interval60")) %>%
group_by(interval60, Origin.fishnetID) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
left_join(fishnet_tract, by=c("Origin.fishnetID" = "uniqueID")) %>%
left_join(st_set_geometry(LouisvilleCensus, NULL), by=c("GEOID"="GEOID")) %>%
left_join(st_set_geometry(tracts.StudyArea, NULL), by = c("Origin.fishnetID" = "GEOID")) %>%
left_join(st_set_geometry(vars_net, NULL), by = c("Origin.fishnetID" = "uniqueID")) %>%
ungroup() %>%
mutate(week =week(ymd_hms(interval60)+60*60*24),
dotw = wday(interval60, label = TRUE)) %>%
st_sf()
ride.panel.start[is.na(ride.panel.start)] <- 0
ride.panel.end <-
subset(ride.template, Start_Datetime >="2019-06-10 00:00:00" & Start_Datetime<"2019-08-05 00:00:00") %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel.all, by=c("Destination.fishnetID" = "fishnetID", "interval60")) %>%
group_by(interval60, Destination.fishnetID) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
left_join(fishnet_tract, by=c("Destination.fishnetID" = "uniqueID")) %>%
left_join(st_set_geometry(LouisvilleCensus, NULL), by=c("GEOID"="GEOID")) %>%
left_join(st_set_geometry(tracts.StudyArea, NULL), by = c("Destination.fishnetID" = "GEOID")) %>%
left_join(st_set_geometry(vars_net, NULL), by = c("Destination.fishnetID" = "uniqueID")) %>%
ungroup() %>%
mutate(week = week(ymd_hms(interval60)+60*60*24),
dotw = wday(interval60, label = TRUE)) %>%
st_sf()
ride.panel.end[is.na(ride.panel.end)] <- 02.7 Time lags
The time lag contains information for a certain length of time before the research point, and it can reflect the trend of the time series. In this study, we created different time lags of 1, 2, 3, 4, 12, and 24 hours, and selected significant ones as model predictors.
ride.panel.start <-
ride.panel.start %>%
arrange(Origin.fishnetID, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = as.factor(ifelse(yday(interval60) == 358,1,0)),
holiday = as.factor(ifelse(yday(interval60) == 326,1,0))) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = replace_na(holidayLag, 0))
dplyr::select(st_set_geometry(ride.panel.start, NULL),
interval60, Origin.fishnetID, Trip_Count,
lagHour, lag2Hours)## # A tibble: 1,094,256 x 5
## interval60 Origin.fishnetID Trip_Count lagHour lag2Hours
## * <dttm> <chr> <dbl> <dbl> <dbl>
## 1 2019-06-10 00:00:00 1 0 NA NA
## 2 2019-06-10 01:00:00 1 0 0 NA
## 3 2019-06-10 02:00:00 1 0 0 0
## 4 2019-06-10 03:00:00 1 0 0 0
## 5 2019-06-10 04:00:00 1 0 0 0
## 6 2019-06-10 05:00:00 1 0 0 0
## 7 2019-06-10 06:00:00 1 0 0 0
## 8 2019-06-10 07:00:00 1 0 0 0
## 9 2019-06-10 08:00:00 1 0 0 0
## 10 2019-06-10 09:00:00 1 0 0 0
## # … with 1,094,246 more rows
ride.panel.end <-
ride.panel.end %>%
arrange(Destination.fishnetID, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = as.factor(ifelse(yday(interval60) == 358,1,0)),
holiday = as.factor(ifelse(yday(interval60) == 326,1,0))) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = replace_na(holidayLag, 0))
dplyr::select(st_set_geometry(ride.panel.end, NULL),
interval60, Destination.fishnetID, Trip_Count,
lagHour, lag2Hours)## # A tibble: 1,094,256 x 5
## interval60 Destination.fishnetID Trip_Count lagHour lag2Hours
## * <dttm> <chr> <dbl> <dbl> <dbl>
## 1 2019-06-10 00:00:00 1 0 NA NA
## 2 2019-06-10 01:00:00 1 0 0 NA
## 3 2019-06-10 02:00:00 1 0 0 0
## 4 2019-06-10 03:00:00 1 0 0 0
## 5 2019-06-10 04:00:00 1 0 0 0
## 6 2019-06-10 05:00:00 1 0 0 0
## 7 2019-06-10 06:00:00 1 0 0 0
## 8 2019-06-10 07:00:00 1 0 0 0
## 9 2019-06-10 08:00:00 1 0 0 0
## 10 2019-06-10 09:00:00 1 0 0 0
## # … with 1,094,246 more rows
3 Exploratory Analysis
In this section, we first analyze the change of the dependent variable from the perspective of time and space, that is, the scooter trips variation across those two dimensions. Then we look at the relationship between weather data and scooter ridership. For all selected social-economic factors, a function has been constructed with the dependent variable to view the relationship. Finally, the three urban facilities and their n nearest are analyzed spatially, and their relationship with the dependent variables are calculated. In this section we first analyze the variation of the dependent variable, the scooter trips, from the perspective of time and space.
3.1 Trips variation across time
The behavior of scooters is related to time series of different scales, such as week of year, day of week, time of day. First, we can view data for all trips starting from 2018. In January 2019, there was an underestimation of the scooter trips, which may be because the weather is too cold in winter and the scooters are not easy to drive on snow. The peak of trips is in the summer of 2019. At this time, the weather is suitable for scooters, and the amount of scooters is already large. After that, the ridership gradually declined, probably because the weather became colder.
ggplot(ride.template %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n))+
labs(title="Scooter trips per hr. Louisville",
x="Date",
y="Number of trips")+
plotTheme()Let’s focus on the research period, June and July. Obviously, the trips of the scooter have a periodic pattern every week, and the curves within a day are similar.
ggplot(subset(ride.template, interval60>="2019-06-10 00:00:00" & interval60<"2019-08-05 00:00:00") %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n), size=1)+
labs(title="Scooter trips per hr. Louisville",
x="Date",
y="Number of trips")+
plotTheme()Keep the time scale down to a day. In each fishnet, the number of trips is generally below 2 per hour. There are fewer trips in the morning and midnight than at noon and afternoon, reflecting that people prefer to use scooters after work instead of during work.
ride.template %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, Origin.fishnetID, time_of_day) %>%
tally()%>%
group_by(Origin.fishnetID, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 0.2)+
labs(title="Mean Number of Hourly Trips Per Fishnet. Louisville",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
plotTheme()The below figure shows scooter trip counts by day of week. First let us look at the pink line of Saturday. Saturday has the highest value almost throughout the day, which indicates that people prefer to use scooters when traveling on weekends. Its curve is smooth because people do not have a uniform schedule on weekends. The weekday trips have different curves during the day. Between 11 am and 4 pm, the scooter trips on weekdays will have concave shapes, which might because people usually work at this time.
ggplot(ride.template %>% mutate(hour = hour(Start_Datetime)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 1, size = 1)+
labs(title="Scooter trips count in Louisville, by day of the week",
x="Hour",
y="Trip Counts")+
xlim(0,24)+
scale_colour_manual(values = palette7)+
plotTheme()In addition, judging from the trend of the day, the scooters do not have obvious AM rushes, but only a high peak in the afternoon. This daily trend is different from traditional transportation modes, which means that people will not choose scooters as the mode of transportation when they go to work in the morning.
ggplot(ride.template %>% mutate(hour = hour(Start_Datetime)))+
geom_histogram(aes(hour), binwidth = 1, size = 1,color="white", fill="#FE7568")+
labs(title="Scooter trips count in Louisville by hour",
x="Hour",
y="Trip Counts")+
xlim(0,24) +
scale_colour_manual(values = palette7)+
plotTheme()What follows is the trips duration distribution. The duration of the most trips is around four to five minutes. Based on an average scooter speed of 12mph, one mile will take 5 minutes, which is the most scooter trips duration. This result is in line with our assumption that scooters aims to achieve the last mile travel.
ggplot(subset(ride.template, TripDuration<40))+
geom_freqpoly(aes(TripDuration, color = dotw), binwidth = 1, size = 1)+
labs(title="Scooter trips duration in Louisville, by day of the week",
x="Duration",
y="Trip Counts")+
scale_colour_manual(values = palette7)+
plotTheme()The final analysis from a time perspective is about time lag. We have built functions for different time lags. As can be seen from the figure, except for the 12-hour lag, the relationships between other time lags and the dependent variable are positively correlated. The absolute slopes of all time lags are greater than 0.4.
plotData.lag <-
as.data.frame(ride.panel.start) %>%
filter(week == 29) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))
correlation.lag <-
plotData.lag %>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2))
plotData.lag %>%
ggplot(aes(Value, Trip_Count)) +
geom_point() + geom_smooth(method = "lm", se = F) +
facet_wrap(~Variable) +
geom_text(data=correlation.lag, aes(label=paste("R =", correlation)),colour = "blue",
x=-Inf, y=Inf, hjust=-0.1, vjust=1.2) +
labs(title = "Rideshare trip count as a function of time lags"
) +
plotTheme()3.2 Trip variation across space
We have drawn the starts and ends maps of the scooter trips. By comparison, it can be found that the distribution of the ends is wider than the starts. This is a natural phenomenon, because people will ride scatter away, but in remote areas, there is rarely a chance that someone finding scooters and ride them back to the city center.
grid.arrange(
ride.panel.start %>%
group_by(Origin.fishnetID) %>%
summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
ggplot() + geom_sf(aes(fill = log(Sum_Trip_Count+1))) +
scale_fill_viridis() +
labs(title="Sum of scooter trip starting points by fishnet") +
mapTheme() + theme(legend.position = "bottom"),
ride.panel.end %>%
group_by(Destination.fishnetID) %>%
summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
ggplot() + geom_sf(aes(fill = log(Sum_Trip_Count+1))) +
scale_fill_viridis() +
labs(title="Sum of scooter trip ending points by fishnet") +
mapTheme() + theme(legend.position = "bottom"),ncol = 2)The following figure shows maps of ridership by day of week. On Saturdays and Sundays, trips are more discrete and wider, which reveals that people tend to travel outside the city on weekends.
ride.panel.start %>%
group_by(dotw, Origin.fishnetID) %>%
summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
ggplot() + geom_sf(aes(fill = log(Sum_Trip_Count+1))) +
facet_wrap(~dotw, ncol = 4) +
scale_fill_viridis() +
labs(title="Sum of rideshare trips by tract and day of the week") +
mapTheme() + theme(legend.position="bottom")These 24 graphs show detail distribution of trips per hour over a day. Its trip counts result is similar to the previous counts by time line chart.
ride.panel.start %>%
group_by(hour = hour(interval60), Origin.fishnetID) %>%
summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
ggplot() + geom_sf(aes(fill = log(Sum_Trip_Count+1))) +
facet_wrap(~hour, ncol = 6) +
scale_fill_viridis() +
labs(title="Sum of rideshare trips by fishnet and hour of the day") +
mapTheme() + theme(legend.position = "bottom") 3.3 Weather
The first is a bar plot of precipitation and trip counts. It shows an eight-week mean trip count by precipitation. Except for week 28, trip counts are higher in rainless weather than when it rains, which means that people prefer to use scooters on sunny days. 28 weeks may be outliers or there are special circumstances. The overall relationship is that rainfall is inversely proportional to trip count.
as.data.frame(ride.panel.start) %>%
group_by(interval60) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
left_join(weather.Panel) %>%
mutate(isPercip = ifelse(Precipitation > 0,"Rain/Snow", "None")) %>%
group_by(week = week(ymd_hms(interval60)+60*60*24), isPercip) %>%
summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
ggplot(aes(isPercip, Mean_Trip_Count)) +
geom_bar(stat = "identity") +
facet_wrap(~week,ncol=9) +
labs(title="Relationship between ridership and rain or snow",
subtitle="Mean trip count by week",
x="Percipitation", y="Mean Trip Count") +
plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))The figure below explores the relationship between temperature and trip count. During the study period, all weekly data showed that scooter trip counts were positively correlated with temperature. This result is in line with our hypothesis that the higher the temperature, the more people are willing to use scooters.
as.data.frame(ride.panel.start) %>%
group_by(interval60) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
left_join(weather.Panel) %>%
mutate(week = week(ymd_hms(interval60)+60*60*24)) %>%
ggplot(aes(Temperature, Trip_Count)) +
geom_point() + facet_wrap(~week) + geom_smooth(method = "lm", se= FALSE) +
plotTheme() +
labs(title="Trip Count as a fuction of Temperature by week",
subtitle="Trip count by weeks: 06/10/2019-08/05/2019",
x="Temperature", y="Mean Trip Count")3.5 City facilities data
Finally, let’s explore the spatial distribution of urban facilities and their relationship with scooter trips. The map of city facilities shows the distribution of facilities in the study area. The n nearest maps shows the distance to the nearest facilities, which are easier to interpret the spatial distributions than the count maps.
vars_net.long.nn <-
vars_net %>%
gather(Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =3, top = "Risk Factors and Positive Factors with Count or Nearest Neighbor"))The correlation figures show counts are more related than their n nearest predictors. The distribution of the facility’s count maps is more concentrated, and the n nearest maps are smoother. Probably because the distribution of trips is also concentrated in the north, similar to count maps, the correlation of count predictors is higher.
plotData.census <-
as.data.frame(ride.panel.start) %>%
filter(week == 30) %>%
group_by(Origin.fishnetID) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
left_join(vars_net, by=c("Origin.fishnetID"="uniqueID"))
correlation.long <-
plotData.census%>%
dplyr::select(-Origin.fishnetID, -geometry) %>%
gather(Variable, Value, -Trip_Count)
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, Trip_Count, use = "complete.obs"))
ggplot(correlation.long, aes(Value, Trip_Count)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "#a2d7d8") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
ylim(0,1000)+
labs(title = "Scooter trips count as a function of city facilities (count/nearest distance)")4 Prediction Model
4.1 Traning set and test set
In this chapter, we are going to make some machine learning models to predict the scooter trips of the given place and the given time. To build the model, we spit the 75% of the data to the training set and 25% to the test set. Which means that data of week 24-29 are training set and week 30&31 are test set. The output plots the training set in dark blue and the test set in light blue.
ride.start.Train <- filter(ride.panel.start, week <30)
ride.start.Test <- filter(ride.panel.start, week >=30)
ride.end.Train <- filter(ride.panel.end, week <30)
ride.end.Test <- filter(ride.panel.end, week >=30)
#borrow
mondayMidnight <-
ride.panel.start %>%
mutate(mondayMidnight = ifelse(wday(interval60) == 2 & hour(interval60) == 1,
as.POSIXct(interval60),0)) %>%
filter(mondayMidnight != 0)
st_set_geometry(
rbind(
mutate(ride.start.Train, label = "Training"),
mutate(ride.start.Test, label = "Testing")), NULL) %>%
group_by(label, interval60) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
ggplot(aes(interval60, Trip_Count, colour = label)) +
geom_line() +
ylim(0,400) +
labs(title="Scooter trips in Louisville by week: June 10th to August 5th, 2019",
subtitle="", x="Day", y="Trip Counts") +
plotTheme() + theme(panel.grid.major = element_blank()) +
scale_colour_manual(values = palette2) +
geom_vline(data = mondayMidnight, aes(xintercept = mondayMidnight), colour="black")4.2 Prediction Model
In this section, five different ordinary least squares regressions are estimated on the scooter trips data. The first regression focuses on just time effects. it includes hour fixed effects, day of the week fixed effects and Temperature. Any lag effects are not included in this regression. The second regression replaces the time fixed effects with a vector of origin fishnet ID fixed effects. That is replace the time information by the space information. The third regression includes both time and space fixed effects. The fourth regression adds the time lag features. And the fifth regression adds city facilities data like bus stations and police stations.
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Temperature, data=ride.start.Train)
reg2 <-
lm(Trip_Count ~ Origin.fishnetID + dotw + Temperature, data=ride.start.Train)
reg3 <-
lm(Trip_Count ~ Origin.fishnetID + hour(interval60) + dotw + Temperature ,
data=ride.start.Train)
reg4 <-
lm(Trip_Count ~ Origin.fishnetID + hour(interval60) + dotw + Temperature +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.start.Train)
reg5 <-
lm(Trip_Count ~ Origin.fishnetID + hour(interval60) + dotw + Temperature +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + bus_stops +construction_Permits +police_station.nn,
data=ride.start.Train)The MAE of different model is plotted by week, as below. The result shows that city facilities and other social-economic factors can not improve model performance. Therefore, to make model small and easy-apply, we will not use city facilities factors as predictors on following research.
ride.start.Test.weekNest <-
ride.start.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions.start <-
ride.start.Test.weekNest %>%
mutate(A_Time_FE = map(.x = data, fit = reg1, .f = model_pred),
B_Space_FE = map(.x = data, fit = reg2, .f = model_pred),
C_Time_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
D_Time_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
E_Time_Space_FE_timeLags_other = map(.x = data, fit = reg5, .f = model_pred))
week_predictions.start <-
week_predictions.start %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean),
sd_AE = map_dbl(Absolute_Error, sd))
ride.start.Test.weekNest %>%
mutate(Trip_Count = map(data, pull, Trip_Count),
Mean_Trip_Count = map_dbl(Trip_Count, mean))
week_predictions.start %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme()For each model, the predicted and observed scooter trips is plotted in time series form below. Both observed and predicted plots summed only by each interval60 and remove the spatial context. We can see that with increasing sophistication, it becomes increasingly possible to predict well. It is also clear that city facilities and other social-economic factors can not make the prediction more accurate. The model also has some shortcomings. It is not accurate enough to predict the peak travel hours.
week_predictions.start %>%
mutate(interval60 = map(data, pull, interval60),
Origin.fishnetID = map(data, pull, Origin.fishnetID)) %>%
dplyr::select(interval60, Origin.fishnetID, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -Origin.fishnetID) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = mean(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
scale_colour_manual(values = palette2) +
labs(title = "Mean Predicted/Observed scooter trips by hourly interval",
subtitle = "Louisville; A test set of 2 weeks in July and August", x = "Hour", y= "Rideshare Trips") +
plotTheme()Next, the MAE of regression 4 is mapped by the unit of fishnet to display the relationship between error and spatial context. The highest trip count error occurs in the downtown area, where more scooter trips happens.
filter(week_predictions.start, Regression == "D_Time_Space_FE_timeLags") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Origin.fishnetID, Absolute_Error, geometry) %>%
gather(Variable, Value, -Origin.fishnetID, -geometry) %>%
group_by(Variable, Origin.fishnetID) %>%
summarize(MAE = mean(Value)) %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
scale_fill_viridis() +
labs(title="Mean Absolute Error Spatial Distribution") +
mapTheme() + theme(legend.position="bottom")The following graph show the time/space MAE map for only week 30. We can see that the spatial distribution of MAE is basically the same throughout the day. The prediction error of 3 PM is relatively large, the reason is that 3 PM is the peak hour of the scooter. Although the number of scooter trips at midnight is not large, its prediction error is large because the travel space distribution at midnight is different from other hours.
filter(week_predictions.start, Regression == "D_Time_Space_FE_timeLags") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Origin.fishnetID, Absolute_Error, geometry, interval60) %>%
gather(Variable, Value, -interval60, -Origin.fishnetID, -geometry) %>%
filter(wday(interval60, label = TRUE) == "Mon" & week(ymd_hms(interval60)+60*60*24) == 31) %>%
group_by(hour = hour(interval60), Origin.fishnetID) %>%
summarize(MAE = mean(Value)) %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~hour, ncol = 4) +
scale_fill_viridis() +
labs(title="Mean absolute trip count error by hour",
subtitle = "For the Monday of Week 31") +
mapTheme() + theme(legend.position="bottom")Then, we check whether the error is related to social-economic factors. We can see that the relationship is not significant
error_plot <-
filter(week_predictions.start, Regression == "D_Time_Space_FE_timeLags") %>%
unnest() %>%
dplyr::select(Origin.fishnetID, Absolute_Error) %>%
gather(Variable, Value, -Origin.fishnetID) %>%
group_by(Variable, Origin.fishnetID) %>%
summarize(MAE = mean(Value))%>%
left_join(st_set_geometry(fishnet_tract, NULL),by=c("Origin.fishnetID"="uniqueID")) %>%
left_join(LouisvilleCensus,by=c("GEOID"="GEOID"))
error_plot = na.omit(error_plot)
p1 <- ggplot(error_plot, aes(x=MAE, y=Percent_White)) +
geom_point() +
geom_smooth(method=lm , color="red", se=FALSE) +
plotTheme()
p2 <- ggplot(error_plot, aes(x=MAE, y=Med_Inc)) +
geom_point() +
geom_smooth(method=lm , color="red", se=FALSE) +
plotTheme()
p3 <- ggplot(error_plot, aes(x=MAE, y=Total_Public_Trans)) +
geom_point() +
geom_smooth(method=lm , color="red", se=FALSE) +
plotTheme()
figure <- ggarrange(p1,p2,p3,ncol = 3)
annotate_figure(figure,
top = text_grob("Error as function of Social-economic Factors", face = "bold", size = 14))4.3 Prediction Model of Scooter return
In this section, we use the same method to predict the demand for scooter return. We can see that from a time or space perspective, it is very similar to the scooter starting. It is not accurate enough to predict the peak travel hours and the highest trip count error occurs in the downtown area, where more scooter trips happen.
reg_end <-
lm(Trip_Count ~ Destination.fishnetID + hour(interval60) + dotw + Temperature +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.end.Train)
ride.end.Test.weekNest <-
ride.end.Test %>%
nest(-week)
week_predictions.end <-
ride.end.Test.weekNest %>%
mutate(End_result = map(.x = data, fit = reg_end, .f = model_pred))
week_predictions.end <-
week_predictions.end %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean),
sd_AE = map_dbl(Absolute_Error, sd))
week_predictions.end[is.na(week_predictions.end)] <- 0
week_predictions.end %>%
mutate(interval60 = map(data, pull, interval60),
Destination.fishnetID.fishnetID = map(data, pull, Destination.fishnetID)) %>%
dplyr::select(interval60, Destination.fishnetID.fishnetID, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -Destination.fishnetID.fishnetID) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = mean(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
scale_colour_manual(values = palette2) +
labs(title = "Mean Predicted/Observed ride share by hourly interval",
subtitle = "Louisville; A test set of 2 weeks in July and August", x = "Hour", y= "Rideshare Trips") +
plotTheme()filter(week_predictions.end, Regression == "End_result") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Destination.fishnetID, Absolute_Error, geometry) %>%
gather(Variable, Value, -Destination.fishnetID, -geometry) %>%
group_by(Variable, Destination.fishnetID) %>%
summarize(MAE = mean(Value)) %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
scale_fill_viridis() +
labs(title="Mean Absolute Error Spatial Distribution") +
mapTheme() + theme(legend.position="bottom")4.4 Demand prediction and deployment
Now we have predicted the start and end of the scooter trip. The following figure shows the comparison of observed and prediction results, and the comparison of the demand of start and end of scooter trips. We can see that demand of end is relatively more dispersed than demand of end, especially in West Louisville. The predictions for both kinds of demand are larger than actual in the suburb area.
grid.arrange(
filter(week_predictions.start, Regression == "D_Time_Space_FE_timeLags") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Origin.fishnetID, Observed, geometry) %>%
gather(Variable, Value, -Origin.fishnetID, -geometry) %>%
group_by(Variable, Origin.fishnetID) %>%
summarize(Log_demand = log(sum(Value)+1)) %>%
ggplot() +
geom_sf(aes(fill = Log_demand)) +
scale_fill_viridis() +
labs(title="Observed Demand of Start") +
mapTheme() + theme(legend.position="bottom"),
filter(week_predictions.end, Regression == "End_result") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Destination.fishnetID, Observed, geometry) %>%
gather(Variable, Value, -Destination.fishnetID, -geometry) %>%
group_by(Variable, Destination.fishnetID) %>%
summarize(Log_demand = log(sum(Value)+1)) %>%
ggplot() +
geom_sf(aes(fill = Log_demand)) +
scale_fill_viridis() +
labs(title="Observed Demand of End") +
mapTheme() + theme(legend.position="bottom"),
filter(week_predictions.start, Regression == "D_Time_Space_FE_timeLags") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Origin.fishnetID, Prediction, geometry) %>%
gather(Variable, Value, -Origin.fishnetID, -geometry) %>%
group_by(Variable, Origin.fishnetID) %>%
summarize(Log_demand = log(sum(Value)+1)) %>%
ggplot() +
geom_sf(aes(fill = Log_demand)) +
scale_fill_viridis() +
labs(title="Predicted Demand of Start") +
mapTheme() + theme(legend.position="bottom"),
filter(week_predictions.end, Regression == "End_result") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Destination.fishnetID, Prediction, geometry) %>%
gather(Variable, Value, -Destination.fishnetID, -geometry) %>%
group_by(Variable, Destination.fishnetID) %>%
summarize(Log_demand = log(sum(Value)+1)) %>%
ggplot() +
geom_sf(aes(fill = Log_demand)) +
scale_fill_viridis() +
labs(title="Predicted Demand of End") +
mapTheme() + theme(legend.position="bottom"),
ncol = 2)By subtracting the demand of start from the demand of end, we can get the demand of deployment needs of each region. If the result is negative, it means that the area needs to deploy more scooters to come in. If the result is positive, it means that there are many idle scooters in the area. The results in the figure below show that the main imbalance occurs in downtown area and east of the downtown. Scooters in the city center are insufficient, and there are more un-used scooters in the east of the city center. We think that the cause is that many people walk into the city center after parking in the east of the city center. After finishing work or entertainment, they use scooter to leave the city center due to fatigue and other reasons.
PDE <-
filter(week_predictions.end, Regression == "End_result") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Destination.fishnetID, Prediction, geometry) %>%
gather(Variable, Value, -Destination.fishnetID, -geometry) %>%
group_by(Variable, Destination.fishnetID) %>%
summarize(Log_demond = log(sum(Value)+1))
PDE[is.na(PDE)] <- 0
result_Observed_start <-
filter(week_predictions.start, Regression == "D_Time_Space_FE_timeLags") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Origin.fishnetID, Observed, geometry) %>%
gather(Variable, Value, -Origin.fishnetID, -geometry) %>%
group_by(Variable, Origin.fishnetID) %>%
summarize(Observed_start = sum(Value))
result_Observed_end <-
filter(week_predictions.end, Regression == "End_result") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Destination.fishnetID, Observed, geometry) %>%
gather(Variable, Value, -Destination.fishnetID, -geometry) %>%
group_by(Variable, Destination.fishnetID) %>%
summarize(Observed_end = sum(Value))
result_Predicted_start <-
filter(week_predictions.start, Regression == "D_Time_Space_FE_timeLags") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Origin.fishnetID, Prediction, geometry) %>%
gather(Variable, Value, -Origin.fishnetID, -geometry) %>%
group_by(Variable, Origin.fishnetID) %>%
summarize(Predicted_start = sum(Value))
result_Predicted_end <-
filter(week_predictions.end, Regression == "End_result") %>%
unnest() %>%
st_sf() %>%
dplyr::select(Destination.fishnetID, Prediction, geometry) %>%
gather(Variable, Value, -Destination.fishnetID, -geometry) %>%
group_by(Variable, Destination.fishnetID) %>%
summarize(Predicted_end = sum(Value))
result_all <-
st_set_geometry(result_Observed_start, NULL) %>%
left_join(st_set_geometry(result_Observed_end, NULL),by=c("Origin.fishnetID"="Destination.fishnetID")) %>%
left_join(st_set_geometry(result_Predicted_start, NULL), by=c("Origin.fishnetID" = "Origin.fishnetID")) %>%
left_join(st_set_geometry(result_Predicted_end, NULL), by=c("Origin.fishnetID"="Destination.fishnetID"))%>%
left_join(fishnet,by=c("Origin.fishnetID"="uniqueID")) %>% st_sf()
result_all$Observed_demand = result_all$Observed_end - result_all$Observed_start
result_all$Predicted_demand = result_all$Predicted_end - result_all$Predicted_start
result_all$Error = result_all$Predicted_demand - result_all$Observed_demand
grid.arrange(
result_all %>%
ggplot() +
geom_sf(aes(fill = Observed_demand)) +
scale_fill_viridis() +
labs(title="Observed_demand") +
mapTheme() + theme(legend.position="bottom"),
result_all %>%
ggplot() +
geom_sf(aes(fill = Predicted_demand)) +
scale_fill_viridis() +
labs(title="Predicted_demand") +
mapTheme() + theme(legend.position="bottom"),
result_all %>%
ggplot() +
geom_sf(aes(fill = Error)) +
scale_fill_viridis() +
labs(title="Error") +
mapTheme() + theme(legend.position="bottom"),ncol = 3)5 App Wireframe
Based on our difference map, we have designed two applications to assist the management of scooters. The first is a web app, whose main users are scooter companies. The second is a mobile app, whose main users are scooter riders.
5.1 Web App
Web app is divided into two interfaces. In the first interface, the main map is the rebalanced map, which can recommend the rebalanced scheme between areas based on the difference map. On the right side of the map is shown a table of specific predicted starts and ends and differences. Below are two line plots and a pie plot to provide relevant forecasting information.
The second interface displays the results of critical exploratory analysis. Users can freely choose the type and date of the map they want to observe. Fishnet’s map display provides users with a rasterized experience.
5.2 Mobile App
The key to our mobile app is to present the reward mechanism. The interface on the left is a 2D map showing that when the user rides from scooter-surplus area to scooter-scarce area, the trip fee will be reduced. On the right is a 3D effect of that trip. The trip will be presented as users ride from low altitude to high altitude, similar to climbing mountains. Raising the altitude will bring users rewards. This “altitude” reward mechanism can prompt users to help the scooter system achieve self-balancing.
6 Conclusion
This project uses a space-time model to predict the starts and ends of the scooter trips in Louisville, and generates a net difference map for the reference of the rebalance plan of the scooter companies. The final difference map shows some extremely imbalanced areas in downtown. After comparing with google map, the largest rich area of scooters was found as East Market District, the largest scarce area of scooters is Downtown Center, and they are adjacent blocks. Our assumption for this phenomenon is that the traffic in the downtown center is inconvenient and traffic jams often occur. People who go to work or shop in downtown center don’t want to wait for congested traffic and choose to ride scooters to an adjacent block and then take public transportation. At the same time, our model performs not good in downtown, where has a high error. This is due to the concentration and a large number of trips in downtown, which makes it difficult for the model to predict the peak.
To improve our algorithm, some steps can be taken in the future: change space time model to Long short-term memory (LSTM), which uses entire sequences of data rather than some lags; reduce the study area further to exclude a large number of empty areas around; and use more data instead of only two month data to build more reliable model.
In general, the space-time model considers a variety of time-space combination methods, and has achieved good results in the project of scooter trips prediction. We think its results can be applied to the development of actual apps.
If you are interested in our project, please check out the video for more details.
3.4 Social-economic factors
Social-economic factors Although space-time panels have replaced census data for regression analysis, exploratory analysis of socioeconomic characteristics is feasible. It can be seen from the results that the absolute value of correlation between medium income and travel time is relatively large, and the correlation between other factors and scooter trips is small.